1. Introduction
2. Problem Statement
3. Installing & Importing Libraries
4. Data Acquisition & Description
6. Exploratory Data Analysis
7. Data Post-Processing
Home Sweet Home Company Introduction
Your client for this project is an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities.
Current Scenario
The current process suffers from the following problems:
They have hired you as a data science consultant. They want to supplement their analysis and prediction with a more feasible and accurate approach.
Your Role
Project Deliverables
Evaluation Metric
# !pip install -q datascience # Package that is required by pandas profiling
# !pip install -q pandas-profiling
# !pip install -q --upgrade pandas-profiling
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd # Importing for panel data analysis
from pandas_profiling import ProfileReport # Import Pandas Profiling (To generate Univariate Analysis)
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import plotly.express as px
import matplotlib.pyplot as plt # Importing pyplot interface using matplotlib
import seaborn as sns # Importin seaborm library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.linear_model import LinearRegression # Importing Linear Regression model
from sklearn.metrics import mean_squared_error # To calculate the MSE of a regression model
from sklearn.metrics import mean_absolute_error # To calculate the MAE of a regression model
from sklearn.metrics import r2_score # To calculate the R-squared score of a regression model
from sklearn.model_selection import train_test_split # To split the data in training and testing part
from sklearn.preprocessing import StandardScaler # Importing Standard Scaler library from preprocessing
from sklearn.preprocessing import LabelEncoder # Importing Label Encoder library from preprocessing
#-------------------------------------------------------------------------------------------------------------------------------
import folium # Importing folium package
from folium import Map, Marker # Importing folium to plot locations on map
#-------------------------------------------------------------------------------------------------------------------------------
import warnings # Importing warning to disable runtime warnings
warnings.filterwarnings('ignore') # Warnings will appear only once
HomeSweetHome = pd.read_csv('C:/Users/Mahesh/Downloads/Python/Term Projects/ML_Intermediate/rentel price/train_data.csv')
print('Data Shape:', HomeSweetHome.shape)
HomeSweetHome.head()
Data Shape: (137023, 17)
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | city | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 149653 | Private bedroom located in Downtown Manhattan | 257599351.0 | Sandra And Katharina | Manhattan | Chinatown | 40.71703 | -73.99538 | Private room | 2.0 | 17.0 | 18/09/19 | 1.04 | 1.0 | 0.0 | New York City | 100.0 |
| 1 | 74702 | Quiet, Comfy West LA Cottage - HSR 19-000047 | 2882551.0 | James | City of Los Angeles | Mar Vista | 34.01257 | -118.44254 | Entire home/apt | 2.0 | 331.0 | 07/09/20 | 4.28 | 1.0 | 170.0 | Los Angeles | 102.0 |
| 2 | 95858 | Home away from Home ! | 287662307.0 | Kahee | Other Cities | Pasadena | 34.14255 | -118.09888 | Entire home/apt | 10.0 | 4.0 | 29/03/20 | 0.37 | 1.0 | 306.0 | Los Angeles | 131.0 |
| 3 | 61301 | Kukui’ula Club Villa 11 | 198477445.0 | Lodge | Kauai | Koloa-Poipu | 21.88471 | -159.48359 | Entire home/apt | 1.0 | 0.0 | NaN | NaN | 19.0 | 358.0 | Hawaii | 2399.0 |
| 4 | 132101 | One bedroom apartment | 87835557.0 | Kostas | Queens | Astoria | 40.76623 | -73.90911 | Entire home/apt | 2.0 | 189.0 | 31/08/20 | 3.82 | 1.0 | 331.0 | New York City | 76.0 |
Dataset Feature Description
The Dataset contains the following columns:
| Column Name | Description |
|---|---|
| host_id | unique host Id |
| host_name | name of the host |
| neighbourhood_group | group in which the neighbourhood lies |
| neighbourhood | name of the neighbourhood |
| latitude | latitude of listing |
| longitude | longitude of listing |
| room_type | type of room |
| minimum_nights | minimum no. of nights required to book. |
| number_of_reviews | total number of reviews on the listing |
| last_review | the date on which listing received its last review |
| reviews_per_month | average reviews per month on listing |
| calculated_host_listings_count | total number of listings by host |
| availability_365 | number of days in the year the listing is available for rent |
| city | region of the listing |
| price | price of listing per night |
HomeSweetHome.columns
Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
'neighbourhood', 'latitude', 'longitude', 'room_type', 'minimum_nights',
'number_of_reviews', 'last_review', 'reviews_per_month',
'calculated_host_listings_count', 'availability_365', 'city', 'price'],
dtype='object')
HomeSweetHome.describe(include='all')
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | city | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 137023.000000 | 137007 | 1.370230e+05 | 137002 | 80300 | 137023 | 137023.000000 | 137023.000000 | 137023 | 137023.000000 | 137023.000000 | 107232 | 107232.000000 | 137023.000000 | 137023.000000 | 137023 | 137023.000000 |
| unique | NaN | 132769 | NaN | 21866 | 17 | 1134 | NaN | NaN | 4 | NaN | NaN | 2271 | NaN | NaN | NaN | 19 | NaN |
| top | NaN | A place of your own | 2BR in Las Vegas | NaN | Michael | Manhattan | Unincorporated Areas | NaN | NaN | Entire home/apt | NaN | NaN | 15/03/20 | NaN | NaN | NaN | New York City | NaN |
| freq | NaN | 44 | NaN | 1228 | 16096 | 5222 | NaN | NaN | 93651 | NaN | NaN | 1858 | NaN | NaN | NaN | 36519 | NaN |
| mean | 85580.723397 | NaN | 9.632567e+07 | NaN | NaN | NaN | 34.584207 | -101.510148 | NaN | 10.446071 | 33.138831 | NaN | 1.387220 | 17.226174 | 163.602446 | NaN | 205.281792 |
| std | 49471.662411 | NaN | 1.007056e+08 | NaN | NaN | NaN | 7.063332 | 28.013884 | NaN | 25.593436 | 61.903532 | NaN | 1.650963 | 52.670175 | 140.766748 | NaN | 504.573579 |
| min | 1.000000 | NaN | 2.300000e+01 | NaN | NaN | NaN | 18.920990 | -159.714900 | NaN | 1.000000 | 0.000000 | NaN | 0.010000 | 1.000000 | 0.000000 | NaN | 0.000000 |
| 25% | 42729.500000 | NaN | 1.431064e+07 | NaN | NaN | NaN | 30.249505 | -118.365710 | NaN | 1.000000 | 1.000000 | NaN | 0.220000 | 1.000000 | 1.000000 | NaN | 75.000000 |
| 50% | 85640.000000 | NaN | 5.274681e+07 | NaN | NaN | NaN | 36.058990 | -90.105520 | NaN | 3.000000 | 7.000000 | NaN | 0.780000 | 2.000000 | 151.000000 | NaN | 120.000000 |
| 75% | 128374.500000 | NaN | 1.543914e+08 | NaN | NaN | NaN | 40.718400 | -73.989320 | NaN | 7.000000 | 37.000000 | NaN | 2.000000 | 7.000000 | 316.000000 | NaN | 200.000000 |
| max | 171279.000000 | NaN | 3.679071e+08 | NaN | NaN | NaN | 45.617270 | -70.995950 | NaN | 1250.000000 | 966.000000 | NaN | 44.060000 | 593.000000 | 365.000000 | NaN | 24999.000000 |
Observations:
minimum_nights for some home stay can range from as low as a 1 to as high as 1250.
price for some home stay can range from as low as a 0 to as high as 24999.
25% of price have around 75.
50% of price have around 120.
75% of price have around 200.
HomeSweetHome.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 137023 entries, 0 to 137022 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 137023 non-null int64 1 name 137007 non-null object 2 host_id 137023 non-null float64 3 host_name 137002 non-null object 4 neighbourhood_group 80300 non-null object 5 neighbourhood 137023 non-null object 6 latitude 137023 non-null float64 7 longitude 137023 non-null float64 8 room_type 137023 non-null object 9 minimum_nights 137023 non-null float64 10 number_of_reviews 137023 non-null float64 11 last_review 107232 non-null object 12 reviews_per_month 107232 non-null float64 13 calculated_host_listings_count 137023 non-null float64 14 availability_365 137023 non-null float64 15 city 137023 non-null object 16 price 137023 non-null float64 dtypes: float64(9), int64(1), object(7) memory usage: 17.8+ MB
Observations:
Out of 16 features, we have 1 int64 datatype features(id), 7 object type features (name, host_name, 'neighbourhood_group','neighbourhood','room_type','last_review','city'), and the rest are of float64 datatype features.
We may have to convert some variables like ('minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'price') into appropriate forms so we can use them for training purposes.
# profile = ProfileReport(rental_df, title="Rental Profiling Report")
# profile.to_file("Rental_report.html")
# print('Accomplished!')
| Observations | Values |
|---|---|
| Number of columns | 17 |
| Number of rows | 137023 |
| Missing cells | 116342 |
| Duplicate rows | 0 |
| Continuous type columns | 10 |
| Categorical type columns | 7 |
| Observations | Values |
|---|---|
| name | 16 |
| host_name | 21 |
| neighbourhood_group | 56723 |
| last_review | 29791 |
| reviews_per_month | 29791 |
| Observations | Values |
|---|---|
| neighbourhood_group | 17 |
| room_type | 4 |
| city | 19 |
Performing Operations
HomeSweetHome.isna().sum()
id 0 name 16 host_id 0 host_name 21 neighbourhood_group 56723 neighbourhood 0 latitude 0 longitude 0 room_type 0 minimum_nights 0 number_of_reviews 0 last_review 29791 reviews_per_month 29791 calculated_host_listings_count 0 availability_365 0 city 0 price 0 dtype: int64
# lets check data for Nul values in 'reviews_per_month'
HomeSweetHome[HomeSweetHome['reviews_per_month'].isna()]
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | city | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 61301 | Kukui’ula Club Villa 11 | 198477445.0 | Lodge | Kauai | Koloa-Poipu | 21.88471 | -159.48359 | Entire home/apt | 1.0 | 0.0 | NaN | NaN | 19.0 | 358.0 | Hawaii | 2399.0 |
| 8 | 47667 | Beautiful, unique 3 bedroom, 3.5 bath home loc... | 225117269.0 | Hank | NaN | Highland | 39.75848 | -105.01377 | Entire home/apt | 3.0 | 0.0 | NaN | NaN | 1.0 | 336.0 | Denver | 299.0 |
| 11 | 159356 | Cozy Room at Roosvelt Island - Full Bed | 16909509.0 | Lu | Manhattan | Roosevelt Island | 40.76440 | -73.94666 | Private room | 5.0 | 0.0 | NaN | NaN | 1.0 | 0.0 | New York City | 70.0 |
| 13 | 88622 | Venue/ Hall / Party | 85868649.0 | Event Space | Other Cities | Glendale | 34.14309 | -118.26325 | Entire home/apt | 1.0 | 0.0 | NaN | NaN | 2.0 | 364.0 | Los Angeles | 700.0 |
| 16 | 159582 | OVERSIZED SUN-FLOODED STUDIO IN SUGAR HILL | 24583865.0 | Rebecca | Manhattan | Harlem | 40.82749 | -73.94484 | Entire home/apt | 30.0 | 0.0 | NaN | NaN | 1.0 | 0.0 | New York City | 95.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 136999 | 130608 | Private 2bdr apt. awesome LES/Chinatown location | 6917811.0 | Gia | Manhattan | Civic Center | 40.71537 | -74.00181 | Private room | 2.0 | 0.0 | NaN | NaN | 1.0 | 0.0 | New York City | 130.0 |
| 137008 | 67221 | Christmas In Maui! | 18397245.0 | Sherri | Maui | Kihei-Makena | 20.68676 | -156.43680 | Entire home/apt | 4.0 | 0.0 | NaN | NaN | 1.0 | 0.0 | Hawaii | 550.0 |
| 137010 | 16023 | The Red Cottage | 10907013.0 | Samantha | NaN | Fort Lauderdale | 26.11271 | -80.16019 | Private room | 1.0 | 0.0 | NaN | NaN | 1.0 | 365.0 | Broward County | 150.0 |
| 137017 | 110268 | Music City Villa in East Nashville sleeps 10 | 35100052.0 | Victor | NaN | District 7 | 36.20469 | -86.73784 | Entire home/apt | 2.0 | 0.0 | NaN | NaN | 4.0 | 337.0 | Nashville | 259.0 |
| 137019 | 103694 | FANTASTIC 2BR/2BA! POOL, CLOSE TO ATTRACTIONS | 174792040.0 | RoomPicks By Victoria | City of Los Angeles | Downtown | 34.04472 | -118.25665 | Private room | 1.0 | 0.0 | NaN | NaN | 14.0 | 91.0 | Los Angeles | 464.0 |
29791 rows × 17 columns
HomeSweetHome[HomeSweetHome['neighbourhood_group'].isna()]['city'].value_counts()
Broward County 8773 Austin 8432 Clark County 6718 New Orleans 5117 Chicago 5052 Nashville 4892 Portland 3448 Denver 3347 Boston 2663 Oakland 2555 Jersey City 2000 Asheville 1641 Columbus 1112 Cambridge 828 Pacific Grove 145 Name: city, dtype: int64
HomeSweetHome["neighbourhood_group"].value_counts(dropna=False)
NaN 56723 Manhattan 16096 Brooklyn 14615 City of Los Angeles 14092 Other Cities 9126 Maui 6361 Honolulu 5025 Queens 4584 Hawaii 3990 Kauai 2588 Unincorporated Areas 2035 Bronx 962 Newport 264 Staten Island 262 Washington 139 Providence 122 Kent 22 Bristol 17 Name: neighbourhood_group, dtype: int64
Observation:
HomeSweetHome["reviews_per_month"].fillna(value=0, inplace=True)
HomeSweetHome["neighbourhood_group"]=np.where(HomeSweetHome["neighbourhood_group"].isnull(),HomeSweetHome['city'],HomeSweetHome["neighbourhood_group"])
print(HomeSweetHome.shape)
(137023, 17)
HomeSweetHome[HomeSweetHome.duplicated()]
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | city | price |
|---|
HomeSweetHome.city.value_counts()
New York City 36519 Los Angeles 25253 Hawaii 17964 Broward County 8773 Austin 8432 Clark County 6718 New Orleans 5117 Chicago 5052 Nashville 4892 Portland 3448 Denver 3347 Boston 2663 Oakland 2555 Jersey City 2000 Asheville 1641 Columbus 1112 Cambridge 828 Rhode Island 564 Pacific Grove 145 Name: city, dtype: int64
# to check distribution of 'neighbourhood_group'
px.histogram(HomeSweetHome, x= 'neighbourhood_group', marginal='box',
nbins=47, title='Distribution of neighbourhood_group')
# to check distribution of 'room_type'
px.histogram(HomeSweetHome, x= 'room_type', marginal='box',
nbins=47, title='Distribution of room_type')
# to check distribution of 'minimum_nights'
plt.figure(figsize=(15,3))
sns.distplot(HomeSweetHome['minimum_nights'])
<AxesSubplot:xlabel='minimum_nights', ylabel='Density'>
# to check distribution of 'number_of_reviews'
plt.figure(figsize=(15,3))
sns.distplot(HomeSweetHome['number_of_reviews'])
<AxesSubplot:xlabel='number_of_reviews', ylabel='Density'>
# to check distribution of 'calculated_host_listings_count'
plt.figure(figsize=(15,3))
sns.distplot(HomeSweetHome['calculated_host_listings_count'])
<AxesSubplot:xlabel='calculated_host_listings_count', ylabel='Density'>
# to check distribution of 'availability_365'
plt.figure(figsize=(15,3))
sns.distplot(HomeSweetHome['availability_365'])
<AxesSubplot:xlabel='availability_365', ylabel='Density'>
px.histogram(HomeSweetHome, x= 'availability_365', marginal='box',
nbins=47, title='Distribution of availability_365')
px.histogram(HomeSweetHome, x= 'price', marginal='box',
nbins=47, title='Distribution of price')
px.histogram(HomeSweetHome[HomeSweetHome['price']<1000], x= 'price', marginal='box',
nbins=47, title='Distribution of price')
Observation:
px.scatter(HomeSweetHome,x='number_of_reviews',y='reviews_per_month')
px.scatter(HomeSweetHome,x='calculated_host_listings_count',y='price')
px.scatter(HomeSweetHome,y='calculated_host_listings_count',x='availability_365')
px.scatter(HomeSweetHome,x='reviews_per_month',y='price')
plt.figure(figsize=(6,4))
sns.barplot(data=HomeSweetHome,x='room_type',y='price')
<AxesSubplot:xlabel='room_type', ylabel='price'>
plt.figure(figsize=(8,8))
sns.heatmap(HomeSweetHome.corr(),annot=True,cmap='icefire')
<AxesSubplot:>
HomeSweetHome.skew()
id 0.001497 host_id 1.051471 latitude -0.725820 longitude -0.740660 minimum_nights 15.694661 number_of_reviews 3.702934 reviews_per_month 2.677926 calculated_host_listings_count 6.388123 availability_365 0.182854 price 20.790929 dtype: float64
Observation:
Categegorical & Continuous Varibale Split
HomeSweetHome.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 137023 entries, 0 to 137022 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 137023 non-null int64 1 name 137007 non-null object 2 host_id 137023 non-null float64 3 host_name 137002 non-null object 4 neighbourhood_group 137023 non-null object 5 neighbourhood 137023 non-null object 6 latitude 137023 non-null float64 7 longitude 137023 non-null float64 8 room_type 137023 non-null object 9 minimum_nights 137023 non-null float64 10 number_of_reviews 137023 non-null float64 11 last_review 107232 non-null object 12 reviews_per_month 137023 non-null float64 13 calculated_host_listings_count 137023 non-null float64 14 availability_365 137023 non-null float64 15 city 137023 non-null object 16 price 137023 non-null float64 dtypes: float64(9), int64(1), object(7) memory usage: 17.8+ MB
HomeSweetHome_copy=HomeSweetHome.copy()
HomeSweetHome_copy[['minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count',
'availability_365','price']]=HomeSweetHome_copy[['minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count',
'availability_365','price']].astype('int')
HomeSweetHome_copy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 137023 entries, 0 to 137022 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 137023 non-null int64 1 name 137007 non-null object 2 host_id 137023 non-null float64 3 host_name 137002 non-null object 4 neighbourhood_group 137023 non-null object 5 neighbourhood 137023 non-null object 6 latitude 137023 non-null float64 7 longitude 137023 non-null float64 8 room_type 137023 non-null object 9 minimum_nights 137023 non-null int32 10 number_of_reviews 137023 non-null int32 11 last_review 107232 non-null object 12 reviews_per_month 137023 non-null int32 13 calculated_host_listings_count 137023 non-null int32 14 availability_365 137023 non-null int32 15 city 137023 non-null object 16 price 137023 non-null int32 dtypes: float64(3), int32(6), int64(1), object(7) memory usage: 14.6+ MB
#Filtering the data without 0 Values in Price
HomeSweetHome_copy=HomeSweetHome_copy[HomeSweetHome_copy['price']>0]
HomeSweetHome_copy.shape
(136981, 17)
HomeSweetHome_copy=HomeSweetHome_copy[HomeSweetHome_copy['price']<400]
HomeSweetHome_copy.shape
(125699, 17)
df_cont = HomeSweetHome_copy.select_dtypes(exclude='object')
df_cat = HomeSweetHome_copy.select_dtypes(include='object')
df_cat.head()
| name | host_name | neighbourhood_group | neighbourhood | room_type | last_review | city | |
|---|---|---|---|---|---|---|---|
| 0 | Private bedroom located in Downtown Manhattan | Sandra And Katharina | Manhattan | Chinatown | Private room | 18/09/19 | New York City |
| 1 | Quiet, Comfy West LA Cottage - HSR 19-000047 | James | City of Los Angeles | Mar Vista | Entire home/apt | 07/09/20 | Los Angeles |
| 2 | Home away from Home ! | Kahee | Other Cities | Pasadena | Entire home/apt | 29/03/20 | Los Angeles |
| 4 | One bedroom apartment | Kostas | Queens | Astoria | Entire home/apt | 31/08/20 | New York City |
| 5 | 10th St Renovated Beauty- Rm 3U | Scott | Oakland | Prescott | Private room | 31/05/19 | Oakland |
df_cont.head()
| id | host_id | latitude | longitude | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 149653 | 257599351.0 | 40.71703 | -73.99538 | 2 | 17 | 1 | 1 | 0 | 100 |
| 1 | 74702 | 2882551.0 | 34.01257 | -118.44254 | 2 | 331 | 4 | 1 | 170 | 102 |
| 2 | 95858 | 287662307.0 | 34.14255 | -118.09888 | 10 | 4 | 0 | 1 | 306 | 131 |
| 4 | 132101 | 87835557.0 | 40.76623 | -73.90911 | 2 | 189 | 3 | 1 | 331 | 76 |
| 5 | 163184 | 1286670.0 | 37.81055 | -122.29792 | 7 | 16 | 0 | 5 | 0 | 59 |
Dropping the un-necessary features
# 'name', 'host_name', 'neighbourhood', 'last_review' can be dropped from Categorical df
df_cat=df_cat.drop(['name', 'host_name', 'neighbourhood', 'last_review'], axis=1)
# 'id','host_id','reviews_per_month' can be dropped from Continuous df
df_cont=df_cont.drop(['id','host_id','reviews_per_month'],axis=1)
df_cont.head()
| latitude | longitude | minimum_nights | number_of_reviews | calculated_host_listings_count | availability_365 | price | |
|---|---|---|---|---|---|---|---|
| 0 | 40.71703 | -73.99538 | 2 | 17 | 1 | 0 | 100 |
| 1 | 34.01257 | -118.44254 | 2 | 331 | 1 | 170 | 102 |
| 2 | 34.14255 | -118.09888 | 10 | 4 | 1 | 306 | 131 |
| 4 | 40.76623 | -73.90911 | 2 | 189 | 1 | 331 | 76 |
| 5 | 37.81055 | -122.29792 | 7 | 16 | 5 | 0 | 59 |
df_cat.head()
| neighbourhood_group | room_type | city | |
|---|---|---|---|
| 0 | Manhattan | Private room | New York City |
| 1 | City of Los Angeles | Entire home/apt | Los Angeles |
| 2 | Other Cities | Entire home/apt | Los Angeles |
| 4 | Queens | Entire home/apt | New York City |
| 5 | Oakland | Private room | Oakland |
df_cat = df_cat.apply(LabelEncoder().fit_transform)
df_cat.head()
| neighbourhood_group | room_type | city | |
|---|---|---|---|
| 0 | 18 | 2 | 14 |
| 1 | 9 | 0 | 11 |
| 2 | 24 | 0 | 11 |
| 4 | 28 | 0 | 14 |
| 5 | 23 | 2 | 15 |
Combining both Catgorical & Continous df
df_comb=pd.concat([df_cat, df_cont], axis = 1)
df_comb.head()
| neighbourhood_group | room_type | city | latitude | longitude | minimum_nights | number_of_reviews | calculated_host_listings_count | availability_365 | price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18 | 2 | 14 | 40.71703 | -73.99538 | 2 | 17 | 1 | 0 | 100 |
| 1 | 9 | 0 | 11 | 34.01257 | -118.44254 | 2 | 331 | 1 | 170 | 102 |
| 2 | 24 | 0 | 11 | 34.14255 | -118.09888 | 10 | 4 | 1 | 306 | 131 |
| 4 | 28 | 0 | 14 | 40.76623 | -73.90911 | 2 | 189 | 1 | 331 | 76 |
| 5 | 23 | 2 | 15 | 37.81055 | -122.29792 | 7 | 16 | 5 | 0 | 59 |
df_comb.skew()
neighbourhood_group 0.229196 room_type 0.788885 city -0.685629 latitude -0.786414 longitude -0.775323 minimum_nights 15.289802 number_of_reviews 3.601265 calculated_host_listings_count 6.702582 availability_365 0.222720 price 1.081080 dtype: float64
def seperate_Xy(data=None):
X = data.drop(labels=['price'], axis=1)
y = data['price']
return X, y
X, y = seperate_Xy(data=df_comb)
X.head()
| neighbourhood_group | room_type | city | latitude | longitude | minimum_nights | number_of_reviews | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 18 | 2 | 14 | 40.71703 | -73.99538 | 2 | 17 | 1 | 0 |
| 1 | 9 | 0 | 11 | 34.01257 | -118.44254 | 2 | 331 | 1 | 170 |
| 2 | 24 | 0 | 11 | 34.14255 | -118.09888 | 10 | 4 | 1 | 306 |
| 4 | 28 | 0 | 14 | 40.76623 | -73.90911 | 2 | 189 | 1 | 331 |
| 5 | 23 | 2 | 15 | 37.81055 | -122.29792 | 7 | 16 | 5 | 0 |
y.head()
0 100 1 102 2 131 4 76 5 59 Name: price, dtype: int32
scaler = StandardScaler()
scaler.fit(X)
X = pd.DataFrame(scaler.transform(X), index = X.index, columns = X.columns + '_S')
X.head()
| neighbourhood_group_S | room_type_S | city_S | latitude_S | longitude_S | minimum_nights_S | number_of_reviews_S | calculated_host_listings_count_S | availability_365_S | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.582539 | 1.354976 | 0.913594 | 0.838742 | 0.963337 | -0.338781 | -0.280974 | -0.294627 | -1.140292 |
| 1 | -0.543619 | -0.701633 | 0.236777 | -0.124875 | -0.648639 | -0.338781 | 4.648905 | -0.294627 | 0.069343 |
| 2 | 1.333311 | -0.701633 | 0.236777 | -0.106193 | -0.636175 | -0.022619 | -0.485077 | -0.294627 | 1.037052 |
| 4 | 1.833826 | -0.701633 | 0.913594 | 0.845813 | 0.966466 | -0.338781 | 2.419469 | -0.294627 | 1.214939 |
| 5 | 1.208182 | 1.354976 | 1.139199 | 0.421000 | -0.788463 | -0.141180 | -0.296674 | -0.218766 | -1.140292 |
y = np.log(y)
y.head()
0 4.605170 1 4.624973 2 4.875197 4 4.330733 5 4.077537 Name: price, dtype: float64
X.skew()
neighbourhood_group_S 0.229196 room_type_S 0.788885 city_S -0.685629 latitude_S -0.786414 longitude_S -0.775323 minimum_nights_S 15.289802 number_of_reviews_S 3.601265 calculated_host_listings_count_S 6.702582 availability_365_S 0.222720 dtype: float64
def Xy_splitter(X=None, y=None):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)
print('Training Data Shape:', X_train.shape, y_train.shape)
print('Testing Data Shape:', X_test.shape, y_test.shape)
return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = Xy_splitter(X=X, y=y)
Training Data Shape: (94274, 9) (94274,) Testing Data Shape: (31425, 9) (31425,)
def model_generator_lr():
return LinearRegression()
clf = model_generator_lr()
def train_n_eval(clf=None):
# Extracting model name
model_name = type(clf).__name__
# Fit the model on train data
clf.fit(X_train, y_train)
# Make predictions using test data
y_pred = clf.predict(X_test)
# Make predictions using test data
y_pred_train = clf.predict(X_train)
# Calculate test accuracy of the model
clf_mae = mean_absolute_error(y_test, y_pred)
# Calculate train accuracy of the model
clf_mae_train = mean_absolute_error(y_train, y_pred_train)
# Calculate test accuracy of the model
clf_mse = mean_squared_error(y_test, y_pred)
# Calculate train accuracy of the model
clf_mse_train = mean_squared_error(y_train, y_pred_train)
# Calculate test accuracy of the model
clf_r2 = r2_score(y_test, y_pred)
# Calculate train accuracy of the model
clf_r2_train = r2_score(y_train, y_pred_train)
# Display the accuracy of the model
print('Performance Metrics for', model_name, ':\n')
print('[Mean Absolute Error Train]:', clf_mae_train)
print('[Mean Absolute Error Test]:', clf_mae, ':\n')
print('*******************************\n')
print('[Mean Sqaured Error Train]:', clf_mse_train)
print('[Mean Sqaured Error Test]:', clf_mse, ':\n')
print('*******************************\n')
print('[Root Mean Sqaured Error Train]:', np.sqrt(clf_mse_train))
print('[Root Mean Sqaured Error Test]:', np.sqrt(clf_mse), ':\n')
print('*******************************\n')
print('[R2-Score Train]:', clf_r2_train)
print('[R2-Score Test]:', clf_r2)
train_n_eval(clf=clf)
Performance Metrics for LinearRegression : [Mean Absolute Error Train]: 0.40754035769661023 [Mean Absolute Error Test]: 0.4048206163822222 : ******************************* [Mean Sqaured Error Train]: 0.25813454531286567 [Mean Sqaured Error Test]: 0.25564061718755987 : ******************************* [Root Mean Sqaured Error Train]: 0.5080694296184978 [Root Mean Sqaured Error Test]: 0.5056091545725412 : ******************************* [R2-Score Train]: 0.35258113777827904 [R2-Score Test]: 0.3548097878310106
Observation:
df_test=pd.read_csv('C:/Users/Mahesh/Downloads/Python/Term Projects/ML_Intermediate/rentel price/test_data.csv')
df_test.head()
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | city | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49653 | Beachfront Bungalow & Home of Honu | 24843776.0 | Marco | Maui | Lahaina | 20.95924 | -156.68334 | Entire home/apt | 3.0 | 79.0 | 25/03/20 | 1.18 | 1.0 | 275.0 | Hawaii |
| 1 | 128272 | Resort-like living in Williamsburg | 14461742.0 | Mohammed | Brooklyn | Williamsburg | 40.71552 | -73.93869 | Entire home/apt | 5.0 | 1.0 | 01/01/16 | 0.02 | 1.0 | 0.0 | New York City |
| 2 | 88753 | Los Angeles Luxury Apartment in Downtown LA | 205959517.0 | Christian | City of Los Angeles | Downtown | 34.04331 | -118.25804 | Entire home/apt | 30.0 | 0.0 | NaN | NaN | 1.0 | 354.0 | Los Angeles |
| 3 | 151475 | E Z Living In Harlem 2 | 244536777.0 | Chester | Manhattan | Harlem | 40.81615 | -73.94359 | Entire home/apt | 3.0 | 4.0 | 01/12/19 | 0.30 | 2.0 | 180.0 | New York City |
| 4 | 134525 | Cozy, neat, spacious 2 BR apartment in East Ha... | 2247818.0 | Gonda | Manhattan | East Harlem | 40.80268 | -73.94051 | Entire home/apt | 10.0 | 33.0 | 13/03/20 | 0.77 | 1.0 | 10.0 | New York City |
df_test["reviews_per_month"].fillna(value=0, inplace=True)
df_test["neighbourhood_group"]=np.where(df_test["neighbourhood_group"].isnull(),df_test['city'],df_test["neighbourhood_group"])
df_test[['minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count',
'availability_365']]=df_test[['minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count',
'availability_365']].astype('int')
df_test_cont = df_test.select_dtypes(exclude='object')
df_test_cat = df_test.select_dtypes(include='object')
df_test_cat=df_test_cat.drop(['name', 'host_name', 'neighbourhood', 'last_review'], axis=1)
df_test_cont=df_test_cont.drop(['id','host_id','reviews_per_month'],axis=1)
df_test_cat = df_test_cat.apply(LabelEncoder().fit_transform)
df_test_comb=pd.concat([df_test_cat, df_test_cont], axis = 1)
scaler.fit(df_test_comb)
df_test_comb = pd.DataFrame(scaler.transform(df_test_comb), index = df_test_comb.index, columns = df_test_comb.columns + '_S')
df_test.head()
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | city | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49653 | Beachfront Bungalow & Home of Honu | 24843776.0 | Marco | Maui | Lahaina | 20.95924 | -156.68334 | Entire home/apt | 3 | 79 | 25/03/20 | 1 | 1 | 275 | Hawaii |
| 1 | 128272 | Resort-like living in Williamsburg | 14461742.0 | Mohammed | Brooklyn | Williamsburg | 40.71552 | -73.93869 | Entire home/apt | 5 | 1 | 01/01/16 | 0 | 1 | 0 | New York City |
| 2 | 88753 | Los Angeles Luxury Apartment in Downtown LA | 205959517.0 | Christian | City of Los Angeles | Downtown | 34.04331 | -118.25804 | Entire home/apt | 30 | 0 | NaN | 0 | 1 | 354 | Los Angeles |
| 3 | 151475 | E Z Living In Harlem 2 | 244536777.0 | Chester | Manhattan | Harlem | 40.81615 | -73.94359 | Entire home/apt | 3 | 4 | 01/12/19 | 0 | 2 | 180 | New York City |
| 4 | 134525 | Cozy, neat, spacious 2 BR apartment in East Ha... | 2247818.0 | Gonda | Manhattan | East Harlem | 40.80268 | -73.94051 | Entire home/apt | 10 | 33 | 13/03/20 | 0 | 1 | 10 | New York City |
# Make predictions using test data
y_test_pred = clf.predict(df_test_comb)
print(np.round(np.exp(y_test_pred),decimals=0))
[151. 133. 147. ... 71. 138. 161.]
index=df_test['id']
result=pd.DataFrame(np.round(np.exp(y_test_pred),decimals=0),index=index)
result.to_csv('C:/Users/Mahesh/Downloads/Python/Term Projects/ML_Intermediate/rentel price/result.csv', header=False)
print("Successful")
Successful
result
| 0 | |
|---|---|
| id | |
| 49653 | 151.0 |
| 128272 | 133.0 |
| 88753 | 147.0 |
| 151475 | 142.0 |
| 134525 | 130.0 |
| ... | ... |
| 161801 | 65.0 |
| 28466 | 134.0 |
| 165580 | 71.0 |
| 40826 | 138.0 |
| 49668 | 161.0 |
34257 rows × 1 columns